OpenCL (short for Open Computing Language) is an open standard for running compute workloads on many different kinds of compute hardware (e.g CPUs, GPU's). The OpenCL trademark is held by Apple, and the standard is developed and released by the Khronos group, a non-for-profit organisation that provides a focal point for the development of royalty-free standards such as OpenGL. The OpenCL specification itself is just a document, and can be downloaded from the Khronos website here. It is then the task of compute hardware vendors to produce software implementations of OpenCL that best make use of their compute devices.
In order to answer how an OpenCL implementation works, we need to start thinking about hardware. In every compute device such as a CPU or GPU there are a number of cores on which software can be run. In OpenCL terminology these cores are called Compute Units. Each Compute Unit makes available to the operating system a number of hardware threads that can run software. In OpenCL terminology we call these hardware threads Processing Elements. For example, an NVIDIA GP102 die is shown below. Each die contains 30 compute units, shown contained by the orange squares. Each compute unit provides 128 processing elements (CUDA cores), so in this example there are $30\times128 = 3840$ processing elements available for use in compute applications.
During execution of an OpenCL program, processing elements each run an instance of a user-specified piece of compiled code called a kernel. Below is an example OpenCL C kernel that takes the absolute value of a single element of an array.
__kernel void vec_fabs(
// Memory allocations that are on the compute device
__global float *src,
__global float *dst,
// Number of elements in the memory allocations
int length) {
// Get our position in the array
size_t gid0 = get_global_id(0);
// Get the absolute value of
if (gid0 < length) {
dst[gid0] = fabs(src[gid0]);
}
}
We want to run a kernel instance for every element of the array. An OpenCL implementation is a way to run kernel instances on processing elements as they become available. The implementation also provides the means to upload and download memory to and from compute devices. We specify how many kernel instances we want at runtime by defining a 3D execution space called a Grid and specifying its size at kernel launch. Every point in the Grid is called a work-item and represents a unique invocation of the kernel. A work-item is equivalent to a single kernel invocation. This is much like defining an execution space using nested loops, however with OpenCL there are no guarantees on the order in which work items are completed.
Work-items are executed in teams called work-groups. In the example above, the grid is of global size (10, 8, 2) and each work-group is of size (5,4,1). The the number of work-groups in each dimension is then (2,2,2). Every work item has access to device memory that it can use exclusively (private memory), access to memory the team can use (local memory), and access to memory that other teams use (global and constant memory). Every kernel invocation or work-item can query its location within the Grid and use that position as a reference to access allocated memory on the compute device at an appropriately calculated offset.
The above concepts form the core ideas surrounding OpenCL. Everything that follows in this course is supporting information on how to prepare compute devices, memory allocations, kernel invocations, and how best to use these concepts together to get the best performance out of your compute devices.
In every accelerated application there is the concept of a host computer with one or more attached compute devices. The host usually has the largest memory space available and the compute device usually has the most compute power and memory bandwidth. This is why we say the application is "accelerated" by the compute device.
At runtime, the host executes the application and compiles kernels for execution on the compute device. The host manages memory allocations and submits kernels to the compute device for execution. For instances where the compute device is a CPU, the host CPU and the compute device are the same.
Every accelerated application follows the same logical progression of steps:
We now discuss the OpenCL components that make these steps possible.
Below is a representation of the core software components that are available to an OpenCL application.
The first is the Platform. This is a software representation of the vendor. A platform provides access to all devices that the platform supports. During device discovery, available platforms must be queried before anything else. A platform provides access to one or more compute devices and possibly even a mixture of accelerator devices from the same vendor.
A Device provides a way to query the capabilites of the compute device and provides a foundation to build a context.
Surrounding the devices is a Context. A Context is like a registry that keeps track of everything (i.e kernel executions and memory allocations) that are happening on the compute device/s. A context is constructed on using both a platform and one or more devices on the platform. There are some benefits (such as memory copies) that could be obtained by encapsulating one or more devices under the same context, however this assumes that devices must belong to the same platform - an assumption which may not be true. A simpler and more general design is to create a unique context for every compute device.
Within the context are Buffers. Buffers are memory allocations managed under the context, and may exist on either the host or the compute device. At runtime memory is migrated to where is needed, but you can have some control over where the buffer lives.
At runtime, source code for the kernels is collated into a Program, and the program is compiled for every device in a context. There must be a program for every context, and every program must be compiled with knowledge of the associated devices under the context.
Once a context has been created and devices are known, then one can create one or more Command queue/s for each device. A command queue is a place to submit work, such as kernel invocations and memory copies.
A Kernel is a component of a compiled Program. At runtime we set the arguments of compiled kernels and then submit kernels to command queues for execution. We can keep track of the status of a command submitted to the command queue using an Event.
In summary we have the following components:
From Wikipedia OpenCL was originally designed by Apple, who developed a proposal to submit to the Khronos group and holds the trademarks to OpenCL. The first specification, OpenCL 1.0, was ratified on November 18, 2008 and the first public release of the standard was on December 2008. Since then a number of different versions of the standard have been released.
Version 1.1 introduced thread safety so that calls to most OpenCL functions from different threads didn't introduce race conditions. If memory allocations in buffers are used to represent 2D and 3D arrays, then Version 1.1 introduced routines to copy rectangular regions of those buffers to and from the host.
Version 1.2 is probably the most significant release of OpenCL. It remained the defacto OpenCL standard for at least 10 years. Abilities such as being able to divide the processing elements of a compute device into sub-devices that share a common cache and offline compilation of kernels were useful. Having math operations conform to the IEEE754 precision standard meant consistent results across heterogeneous compute architectures.
Version 2.0 introduced support for Shared Virtual Memory (SVM). Implementation of SVM meant we no longer needed to qualify which space (i.e global, local..) a memory allocation belonged to, and memory could be transferred to and from devices transparently to the user. This was too much for some vendors to implement however, and a few vendor implementations remained at 1.2 for a number of years.
Version 2.1 brought the SPIR-V (Standard Portable Intermediate Representation) language to OpenCL. During compilation a open-source compiler can take C or C++ kernel code and emit a compiled program as SPIR-V intermediate code. At runtime this program is loaded by the application and passed to the vendor driver for further compilation to binary code that can run on the compute device. This is a significant advancement, because if a vendor can implement support for SPIR-V then it dramatically reduces the number of intermediate representations the vendor compiler must support. It also offloads support for kernel language advancements to the open source compiler and provides a measure of security against intellectual property theft.
Version 2.2 allowed kernels to be produced using a subset of the C++14 standard. It also updated support for SPIR-V to version 1.2. The combination of shared virtual memory, C++ kernels, and SPIR-V support meant that very few vendors actually succeeded in producing viable implementations of OpenCL 2.2, and OpenCL stagnated for a period of 5 years.
Version 3.0 addressed the issue of stagnation by making Version 1.2 standard and all the other improvements in Version 2.x optional. This gave vendors freedom to implement what they wanted for customers and gave the standard some breathing room. Version 3.0 also introduced a new C++ language for kernels (called C++ for OpenCL) that uses a subset of the C++17 standard. The Clang compiler supports compilation of C++ for OpenCL kernels into SPIR-V format.
Below is a summary of major features implemented with each release:
| Specification | Release year | Specifics |
|---|---|---|
| 1.0 | 2008 | Initial implementation |
| 1.1 | 2010 |
|
| 1.2 | 2011 |
|
| 2.0 | 2013 |
|
| 2.1 | 2015 |
|
| 2.2 | 2015 |
|
| 3.0 | 2020 |
|
All of the major vendors have OpenCL implementations at varying levels of support for the OpenCL specification. The table below shows the latest known level of support for each version of the specification, along with links to the vendor's OpenCL developer page.
| Vendor | 1.2 | 2.0 | 2.1 | 2.2 | 3.0 |
|---|---|---|---|---|---|
| AMD | Y | Y | Y | Some | N |
| Apple | Y | N | N | N | N |
| ARM | Y | Y | Y | N | Y |
| Intel | Y | Y | Y | Some | Y |
| NVIDIA | Y | N | N | N | Y |
| Portable OpenCL | Y | Some | N | N | N |
Apple was the original vendor for OpenCL and it comes baked into the MacOS operating system. However the company has since moved on to their proprietary framework Metal and they haven't invested in OpenCL beyond specification 1.2. Support for OpenCL is built in to NVIDIA's CUDA toolkit, though after an initial flurry of development activity up to version 1.2, development stalled until version 3.0. Support for OpenCL with AMD is part of the ROCM suite. Intel strongly supports OpenCL development for CPU's and GPU's with its oneAPI toolkit. The CPU implementation also works for AMD CPU's, which is really good! ARM has solid support for OpenCL on its Mali GPU's. The open source POCL (Portable OpenCL) implementation has a CPU implementation as well as support for OpenCL on CUDA and OpenCL on MacOS.
A conformant OpenCL implementation is an implementation of OpenCL that has passed Khronos' test suite. The number of vendors with conformant implementations is an evolving list, click here to see the latest conformant implementations.
The best source of help for OpenCL is Khronos OpenCL registry. There you can find excellent documentation on the latest specification that your vendor supports. As an exercise, download the latest API specification in PDF format and have it ready as reference material.
Download from the Khronos OpenCL registry the latest OpenCL API and C language specifications to your computer.
This is sometimes a difficult question to answer. Researchers often have diverse computing environments, in such cases OpenCL is a good fit as it will provide a solid foundation for your research tools. However if you are looking for the best possible performance and can live with vendor lock-in, then using vendor-specific tools will help with that.
Drawbacks to using OpenCL
Benefits of using OpenCL
Just to avoid confusion there are two compilation steps for OpenCL applications:
During program execution, kernels are combined into programs and the programs are compiled for each compute device using the vendor's kernel compiler. Thankfully, when compiling an OpenCL application prior to execution (Step 1), we don't need to link against every available implementation. We just need to link against a single library file called the Installable Client Driver (ICD) that may be provided by any vendor. The ICD has the name (opencl.dll) on Windows and (libOpenCL.so) on Linux. Accompanying the ICD are header files (opencl.h for C and cl.hpp for C++) that must be "included" from the C/C++ source code. The ICD takes care of intercepting all OpenCL library calls and routing them to the appropriate vendor implementation. The routing process happens transparently to the user.
At the location hello_devices.cpp is a complete OpenCL application to obtain the size of on-device memory and the maximum Buffer size that is possible within that memory.
cd src/L1_Introduction
| File | Directory |
|---|---|
| ICD loader (libOpenCL.so) | /usr/local/cuda/lib64 |
| OpenCL C++ headers directory (CL) | /usr/local/cuda/include |
In the Terminal use ls to list the contents of these directories and locate the CL directory in which the OpenCL header files are located.
g++ -g -O2 -I/usr/include -I../include -L/usr/lib64 hello_devices.cpp\
-o hello_devices.exe -lOpenCL
./hello_devices.exe
You should see at least one device printed with the name and memory sizes. Now that you know how to let the compiler know about OpenCL you can use the make command within that directory to compile the example.
!make clean; make
rm -r *.exe g++ -std=c++11 -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib64 hello_devices.cpp\ -o hello_devices.exe -lOpenCL -lomp In file included from hello_devices.cpp:2:0: ../include/cl_helper.hpp: In function ‘_cl_command_queue** h_create_command_queues(_cl_device_id**, _cl_context**, cl_uint, cl_uint, cl_bool, cl_bool)’: ../include/cl_helper.hpp:315:9: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations] ); ^ In file included from /usr/include/CL/opencl.h:24:0, from ../include/cl_helper.hpp:15, from hello_devices.cpp:2: /usr/include/CL/cl.h:1906:1: note: declared here clCreateCommandQueue(cl_context context, ^~~~~~~~~~~~~~~~~~~~
This application is rather rudimentary, however there is a far more sophisticated OpenCL query application called clinfo. You can use it to query a great deal on information on the available devices. Here we use clinfo to query available platforms and devices.
!clinfo -l
Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM) `-- Device #0: Intel(R) FPGA Emulation Device Platform #1: Intel(R) OpenCL `-- Device #0: AMD Ryzen 7 6800H with Radeon Graphics Platform #2: AMD Accelerated Parallel Processing Platform #3: NVIDIA CUDA `-- Device #0: NVIDIA GeForce RTX 3060 Laptop GPU